Skip to content

During PlannedReparentShard, use timeouts specified by -wait_slave_timeout for PromoteSlaveWhenCaughtUp#4850

Merged
deepthi merged 1 commit intovitessio:masterfrom
dasl-:wait_slave_timeout
May 1, 2019
Merged

During PlannedReparentShard, use timeouts specified by -wait_slave_timeout for PromoteSlaveWhenCaughtUp#4850
deepthi merged 1 commit intovitessio:masterfrom
dasl-:wait_slave_timeout

Conversation

@dasl-
Copy link
Copy Markdown
Member

@dasl- dasl- commented Apr 30, 2019

Slack discussion: https://vitess.slack.com/archives/C0PQY0PTK/p1556652127032000

I see that during PlannedReparentShard, there is a wait_slave_timeout param:

-wait_slave_timeout duration
time to wait for slaves to catch up in reparenting (default 30s)

waitSlaveTimeout appears to be used for:

  1. searching for best new master candidate (which has executed most transactions):
    masterElectTabletAlias, err = wr.chooseNewMaster(ctx, shardInfo, tabletMap, avoidMasterTabletAlias, waitSlaveTimeout)
  2. reparenting all replicas to the new master after the new master has been enabled:
    replCtx, replCancel := context.WithTimeout(ctx, waitSlaveTimeout)

It is not used for waiting for replication to catch up on the new master:

remoteCtx, remoteCancel = context.WithTimeout(ctx, *topo.RemoteOperationTimeout)
defer remoteCancel()
// Wait on the master-elect tablet until it reaches that position,
// then promote it
wr.logger.Infof("promote slave %v", masterElectTabletAliasStr)
event.DispatchUpdate(ev, "promoting slave")
rp, err = wr.tmc.PromoteSlaveWhenCaughtUp(remoteCtx, masterElectTabletInfo.Tablet, rp)

That uses RemoteOperationTimeout which defaults to 30s:

RemoteOperationTimeout = flag.Duration("remote_operation_timeout", 30*time.Second, "time to wait for a remote operation")

Based on the flag description of the wait_slave_timeout flag, I'd expect it to be used for waiting for replication to catch up.

Signed-off-by: dleibovic dleibovic@etsy.com

@dasl- dasl- requested a review from sougou as a code owner April 30, 2019 21:14
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DemoteMaster and UndoDemoteMaster should not be using waitSlaveTimeout.
Also, default waitSlaveTimeout to topo.RemoteOperationTimeout at https://github.com/vitessio/vitess/blob/master/go/vt/vtctl/reparent.go#L103

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i've updated.

…meout for PromoteSlaveWhenCaughtUp

Signed-off-by: dleibovic <dleibovic@etsy.com>
@dasl- dasl- force-pushed the wait_slave_timeout branch from 5f9a247 to baa15c2 Compare May 1, 2019 00:36
@dasl- dasl- changed the title During PlannedReparentShard, use timeouts specified by -wait_slave_timeout for all operations During PlannedReparentShard, use timeouts specified by -wait_slave_timeout for PromoteSlaveWhenCaughtUp May 1, 2019
Copy link
Copy Markdown
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@deepthi deepthi merged commit 16492b1 into vitessio:master May 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants